Fix Unicode token boundaries for non-ASCII “Other” scripts by Konf · Pull Request #1 · darkskygit/memory-indexer

Konf · 2026-03-10T05:55:39Z

This PR fixes tokenization for all non-ASCII scripts that go through SegmentScript::Other, e.g Cyrillic languages, Greek, Armenian, etc.

Bug summary

DefaultTextNormalizer::normalize had explicit split logic for ASCII (normalize_ascii_split), but non-ASCII Other segments were processed as a single token via normalize_span.
This missed token boundaries in many languages that rely on whitespace/punctuation separation.
As a result, non-ASCII Other text was normalized as one full segment (for example, a phrase with spaces/punctuation), which made search queries fail.

Fix

This PR adds normalize_unicode_split function that acts like normalize_ascii_split, but for unicode strings and wires that function for SegmentScript::Other text normalization.
Now Other text is split into Unicode word-like spans before normalization, so terms are indexed and matched independently.
Also, 4 tests were added to cover and reproduce such issues and don't allow regressions in future.

Konf added 2 commits March 10, 2026 08:41

Add proper word-split for unicode (SegmentScript::Other)

4dd7a00

Add tests to cover unicode split issues

a38ef6f

Konf mentioned this pull request Mar 10, 2026

[Bug]: Search is broken in desktop apps for non-latin and non-CJK words toeverything/AFFiNE#14595

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Unicode token boundaries for non-ASCII “Other” scripts#1

Fix Unicode token boundaries for non-ASCII “Other” scripts#1
Konf wants to merge 2 commits intodarkskygit:masterfrom
Konf:unicode_fix

Konf commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Konf commented Mar 10, 2026

Bug summary

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant